537 research outputs found

    Estimation of instrinsic dimension via clustering

    Full text link
    The problem of estimating the intrinsic dimension of a set of points in high dimensional space is a critical issue for a wide range of disciplines, including genomics, finance, and networking. Current estimation techniques are dependent on either the ambient or intrinsic dimension in terms of computational complexity, which may cause these methods to become intractable for large data sets. In this paper, we present a clustering-based methodology that exploits the inherent self-similarity of data to efficiently estimate the intrinsic dimension of a set of points. When the data satisfies a specified general clustering condition, we prove that the estimated dimension approaches the true Hausdorff dimension. Experiments show that the clustering-based approach allows for more efficient and accurate intrinsic dimension estimation compared with all prior techniques, even when the data does not conform to obvious self-similarity structure. Finally, we present empirical results which show the clustering-based estimation allows for a natural partitioning of the data points that lie on separate manifolds of varying intrinsic dimension

    Matroid Bandits: Fast Combinatorial Optimization with Learning

    Full text link
    A matroid is a notion of independence in combinatorial optimization which is closely related to computational efficiency. In particular, it is well known that the maximum of a constrained modular function can be found greedily if and only if the constraints are associated with a matroid. In this paper, we bring together the ideas of bandits and matroids, and propose a new class of combinatorial bandits, matroid bandits. The objective in these problems is to learn how to maximize a modular function on a matroid. This function is stochastic and initially unknown. We propose a practical algorithm for solving our problem, Optimistic Matroid Maximization (OMM); and prove two upper bounds, gap-dependent and gap-free, on its regret. Both bounds are sublinear in time and at most linear in all other quantities of interest. The gap-dependent upper bound is tight and we prove a matching lower bound on a partition matroid bandit. Finally, we evaluate our method on three real-world problems and show that it is practical

    Towards a Deeper Understanding of Agricultural Production Systems in Sweden – Linking Farmer’s Logics with Environmental Consequences and the Landscape

    Get PDF
    Farm restructuring is a continuous on-going process supported by national agricultural policy in Sweden; while striving for more efficient farms in terms of labor and yields, farms enlarge their holdings of arable land and animals. The environmental consequences of more intensive land uses have in turn stimulated environmental policies to deal with negative environmental consequences. In this paper we argue that an underlying problem with both of these policy approaches is that they primarily emphasize specific components of farms and fail to see the farm as an interconnected system. In this paper we therefore focus on the farm as a ‘system’ and on the systemic role of farming in the broader landscape. We develop a theoretical framework of farming logics which help to better understand agricultural production systems. Drawing on 34 semi-structured interviews with farmers, we divide the farms into three farming logic categories: I) ‘production vanguards’; II) ‘landscape stewards’; and III) ‘environmental vanguards’. We use these categories to analyze the role of key aspects such as size, intensity of production, specialisation, how farmer preferences and knowledge influence land use systems, and interactions of these with the local landscape. The findings show how farms that on the one hand share some basic characteristics can display quite different farming logics and vice versa. We argue that these farming logics offer a potentially positive diversity in farming approaches, with complementary and mutually dependent roles in Sweden’s overall food system

    Active Clustering: Robust and Efficient Hierarchical Clustering using Adaptively Selected Similarities

    Full text link
    Hierarchical clustering based on pairwise similarities is a common tool used in a broad range of scientific applications. However, in many problems it may be expensive to obtain or compute similarities between the items to be clustered. This paper investigates the hierarchical clustering of N items based on a small subset of pairwise similarities, significantly less than the complete set of N(N-1)/2 similarities. First, we show that if the intracluster similarities exceed intercluster similarities, then it is possible to correctly determine the hierarchical clustering from as few as 3N log N similarities. We demonstrate this order of magnitude savings in the number of pairwise similarities necessitates sequentially selecting which similarities to obtain in an adaptive fashion, rather than picking them at random. We then propose an active clustering method that is robust to a limited fraction of anomalous similarities, and show how even in the presence of these noisy similarity values we can resolve the hierarchical clustering using only O(N log^2 N) pairwise similarities

    Deep Unsupervised Clustering Using Mixture of Autoencoders

    Full text link
    Unsupervised clustering is one of the most fundamental challenges in machine learning. A popular hypothesis is that data are generated from a union of low-dimensional nonlinear manifolds; thus an approach to clustering is identifying and separating these manifolds. In this paper, we present a novel approach to solve this problem by using a mixture of autoencoders. Our model consists of two parts: 1) a collection of autoencoders where each autoencoder learns the underlying manifold of a group of similar objects, and 2) a mixture assignment neural network, which takes the concatenated latent vectors from the autoencoders as input and infers the distribution over clusters. By jointly optimizing the two parts, we simultaneously assign data to clusters and learn the underlying manifolds of each cluster.Part of this work was done when Dejiao Zhang was doing an internship at Technicolor Research. Both Dejiao Zhang and Laura Balzano’s participations were funded by DARPA-16-43-D3M-FP-037. Both Yifan Sun and Brian Eriksson's participation occurred while also at Technicolor Research.https://deepblue.lib.umich.edu/bitstream/2027.42/145190/1/mixae_arxiv_submit.pdfDescription of mixae_arxiv_submit.pdf : Main tech repor

    Efficient Replication of Over 180 Genetic Associations with Self-Reported Medical Data

    Get PDF
    While the cost and speed of generating genomic data have come down dramatically in recent years, the slow pace of collecting medical data for large cohorts continues to hamper genetic research. Here we evaluate a novel online framework for amassing large amounts of medical information in a recontactable cohort by assessing our ability to replicate genetic associations using these data. Using web-based questionnaires, we gathered self-reported data on 50 medical phenotypes from a generally unselected cohort of over 20,000 genotyped individuals. Of a list of genetic associations curated by NHGRI, we successfully replicated about 75% of the associations that we expected to (based on the number of cases in our cohort and reported odds ratios, and excluding a set of associations with contradictory published evidence). Altogether we replicated over 180 previously reported associations, including many for type 2 diabetes, prostate cancer, cholesterol levels, and multiple sclerosis. We found significant variation across categories of conditions in the percentage of expected associations that we were able to replicate, which may reflect systematic inflation of the effects in some initial reports, or differences across diseases in the likelihood of misdiagnosis or misreport. We also demonstrated that we could improve replication success by taking advantage of our recontactable cohort, offering more in-depth questions to refine self-reported diagnoses. Our data suggests that online collection of self-reported data in a recontactable cohort may be a viable method for both broad and deep phenotyping in large populations
    • 

    corecore